1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")
tidyweather <- weather %>%
  select(-c("J-D", "D-N", "DJF", "MAM", "JJA", "SON")) %>% # select relevant columns
  pivot_longer(cols = 2:13, names_to = "Month", values_to = "delta") # make data frame tidy

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

In the following chunk of code, I used the eval=FALSE argument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframe tidyweather. When you actually want to run this code and knit your document, you must delete eval=FALSE, not just here but in all chunks were eval=FALSE appears.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")),
         month = month(date), #had to get rid of label = true. TA advised could be a package loading problem
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies drastically increasing over time"
  )

Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.

  • We can see that the weather anomalies are more pronounced in the winter months!

It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

Inspect the comparison dataframe by clicking on it in the Environment pane.

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.

ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +   #density plot with transparency set to 20%
  theme_bw() +                #theme
  labs (
    title = "Density Plot for Monthly Temperature Anomalies",
    y     = "Density"         #changing y-axis label 
  )

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  # use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(annual_average_delta = mean(delta, na.rm=TRUE)) 

#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth() +
  
  #change to theme_bw() to have white background + black frame around plot
  theme_bw() +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Delta"
  )                         

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

formula_ci <- comparison %>% 
  # clean NAs and choose the interval 2011-present
  drop_na(delta) %>% 
  filter(Year >= 2011) %>% 
  # calculate yearly mean temperature deviation (delta) 
  group_by(Year) %>% 
  summarise(year_mean_delta = mean(delta)) %>%    
  
  # Confidence Interval (CI) using the formula mean +- MoE
  summarise(mean_delta = mean(year_mean_delta), # calculate summary statistics for yearly mean temperature deviation (delta) 
            sd_delta = sd(year_mean_delta),
            count = n(),
            t_critical = qt(0.975, count-1),  # get t-critical value with (n-1) degrees of freedom
            se_delta = sd(year_mean_delta)/sqrt(count), # calculate mean, SD, count, SE, lower/upper 95% CI
            margin_of_error = t_critical * se_delta,
            delta_low = mean_delta - margin_of_error,
            delta_high = mean_delta + margin_of_error)


formula_ci
## # A tibble: 1 x 8
##   mean_delta sd_delta count t_critical se_delta margin_of_error delta_low
##        <dbl>    <dbl> <int>      <dbl>    <dbl>           <dbl>     <dbl>
## 1      0.973    0.204     9       2.31   0.0680           0.157     0.816
## # … with 1 more variable: delta_high <dbl>
set.seed(1234)

boot_yearly_mean_delta <- comparison %>%
  
  # Get rid of NAs in 2019
  drop_na(delta) %>% 
  
  # Choose only  2011 and following
  filter(Year >= 2011) %>%
  
  # Create yearly mean deltas
  group_by(Year) %>% 
  summarise(year_mean_delta = mean(delta)) %>%
  
  # Specify the variable of interest
  specify(response = year_mean_delta) %>%
  
  # Generate a bunch of bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  
  # Find the mean of each sample
  calculate(stat = "mean")

# Calculate bootstrap method confidence intervals
percentile_ci <- boot_yearly_mean_delta %>% 
  get_confidence_interval(level = 0.95, type = "percentile")
percentile_ci
## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.855     1.10

What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!

In the above sections, the confidence interval has been calculated using 2 different method: using CI formula and using bootstrap. In the CI formula method, each year’s annual delta has been calculated by taking the arithmetic mean of all months’ delta in that corresponding year. This yields a total of 9 observations from 2011 until 2019. Then, the standard deviation of the data and standard error of the mean are calculated, and then combined with a t-statistic approximation to obtain a 95% confidence interval for the average annual delta. Meanwhile, in the bootstrap method, repeated samples are taken from the data to provide an estimate of the average annual delta. Using the 1st method, we are 95% confident that the actual average is between 0.816 - 1.13. Using the 2nd method, the confidence interval is narrower: 0.855 - 1.1, see the first plot below. Here, the bootstrap distribution is quite close to normal distribution, as is demonstrated in the second plot below, so the 2 confidence intervals should be similar to some extent.

# Visualise bootstrap CI vs formula CI
visualize(boot_yearly_mean_delta) +
  shade_ci(endpoints = percentile_ci,fill = "yellow")+
  labs(title='Bootstrap CI for Yearly Mean Temperature Deviation narrower than Formula CI',
       subtitle = 'Formula CI shown with dotted red lines, Bootstrap CI in green')+
  geom_vline(xintercept = formula_ci$delta_low, colour = "red", linetype="dashed", size=1.2)+
  geom_vline(xintercept = formula_ci$delta_high, colour = "red", linetype="dashed", size=1.2)+
  theme_bw()+
  NULL

# compare bootstrap distribution with a Normal distribution with parameters estimated from the sample
ggplot(boot_yearly_mean_delta, aes(x = stat)) +
  geom_density(color="blue") +
  stat_function(
    fun = dnorm,
    color = "red",
    size = 2,
    args = list(mean = formula_ci$mean_delta, sd = formula_ci$se_delta)
  )+
  theme_bw()+
  labs(title = "The Bootstrap distribution is close to a Normal distribution",
       x= 'Average rating', y = "")+
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())+
  NULL

2 General Social Survey (GSS)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends in attitudes, behaviours, and attributes. Many trends have been tracked for decades, so one can see the evolution of attitudes, etc in American Society.

In this assignment we analyze data from the 2016 GSS sample data, using it to estimate values of population parameters of interest about US adults. The GSS sample data file has 2867 observations of 935 variables, but we are only interested in very few of these variables and you are using a smaller file.

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable"))

You will also notice that many responses should not be taken into consideration, like “No Answer”, “Don’t Know”, “Not applicable”, “Refused to Answer”.

We will be creating 95% confidence intervals for population parameters. The variables we have are the following:

  • hours and minutes spent on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.50 hours, this would be recorded as emailhr = 2 and emailmin = 30.
  • snapchat, instagrm, twitter: whether respondents used these social media in 2016
  • sex: Female - Male
  • degree: highest education level attained

2.1 Instagram and Snapchat, by sex

Can we estimate the population proportion of Snapchat or Instagram users in 2016?

  1. Create a new variable, snap_insta that is Yes if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and No if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.
  2. Calculate the proportion of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.
  3. Using the CI formula for proportions, please construct 95% CIs for men and women who used either Snapchat or Instagram
gss_new <- gss %>%
  mutate(snap_insta = if_else(snapchat == "Yes", "Yes", if_else(instagrm == "Yes", "Yes", "No", "NA"), "NA")) %>%
  mutate(snap_insta =  na_if(snapchat, "NA"))  # Assign NA values (and not just "NA" characters) to missing data

Insta_snap <- gss_new %>%
  group_by(snap_insta) %>%
  filter(!is.na(snap_insta)) %>% # NAs excluded
  summarise(count = n()) %>%
  mutate(usage_proportion = count/sum(count)) %>%
  filter(snap_insta == "Yes")

CI_by_sex <- gss_new %>%
  filter(!is.na(snap_insta)) %>%
  group_by(sex, snap_insta) %>%
  summarise(count = n()) %>%
 # create statistics required to build CIs
  mutate(usage_proportion = count/sum(count),
         t_critical = qt(0.975, sum(count)-1),
         se_proportion = sqrt(usage_proportion * (1 - usage_proportion)/sum(count)),
         margin_of_error = t_critical * se_proportion,
         # calculate CI
         proportion_low = usage_proportion - margin_of_error,
         proportion_high = usage_proportion + margin_of_error) %>%
# Shorten data frame to only include relevant information
  filter(snap_insta == "Yes") %>%
  select(sex, usage_proportion, proportion_low, proportion_high) 

gss_new
## # A tibble: 2,867 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree         snap_insta
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <chr>          <chr>     
##  1 0        12      NA       NA       NA      Male   Bachelor       <NA>      
##  2 30       0       No       No       No      Male   High school    No        
##  3 NA       NA      No       No       No      Male   Bachelor       No        
##  4 10       0       NA       NA       NA      Female High school    <NA>      
##  5 NA       NA      Yes      Yes      No      Female Graduate       Yes       
##  6 0        2       No       Yes      No      Female Junior college No        
##  7 0        40      NA       NA       NA      Male   High school    <NA>      
##  8 NA       NA      Yes      Yes      No      Female High school    Yes       
##  9 0        0       NA       NA       NA      Male   High school    <NA>      
## 10 NA       NA      No       No       No      Male   Junior college No        
## # … with 2,857 more rows
Insta_snap
## # A tibble: 1 x 3
##   snap_insta count usage_proportion
##   <chr>      <int>            <dbl>
## 1 Yes          311            0.227
CI_by_sex
## # A tibble: 2 x 4
## # Groups:   sex [2]
##   sex    usage_proportion proportion_low proportion_high
##   <chr>             <dbl>          <dbl>           <dbl>
## 1 Female            0.234          0.204           0.264
## 2 Male              0.217          0.184           0.250

2.2 Twitter, by education level

Can we estimate the population proportion of Twitter users by education level in 2016?.

There are 5 education levels in variable degree which, in ascneding order of years of education, are Lt high school, High School, Junior college, Bachelor, Graduate.

  1. Turn degree from a character variable into a factor variable. Make sure the order is the correct one and that levels are not sorted alphabetically which is what R by default does.
  2. Create a new variable, bachelor_graduate that is Yes if the respondent has either a Bachelor or Graduate degree. As before, if the recorded value for either was NA, the value in your new variable should also be NA.
  3. Calculate the proportion of bachelor_graduate who do (Yes) and who don’t (No) use twitter.
  4. Using the CI formula for proportions, please construct two 95% CIs for bachelor_graduate vs whether they use (Yes) and don’t (No) use twitter.
  5. Do these two Confidence Intervals overlap?
gss$degree <- as.factor(gss$degree)
gss$degree <- factor(gss$degree, levels = c("Lt high school", "High school", "Junior college", "Bachelor", "Graduate"))

gss_edu <- gss %>%
  mutate(bachelor_graduate = if_else(degree %in% c("Bachelor", "Graduate"), "Yes", "No", "NA"),
         bachelor_graduate = na_if(bachelor_graduate, "NA")) # Change entries in bachelor_graduate columns with "NA" characters into NA values

twitter_usage <- gss_edu %>% 
  filter(bachelor_graduate == "Yes",
         twitter != "NA") %>%
  group_by(twitter) %>%
  summarise(count = n()) %>%
  mutate(usage_proportion = count/sum(count))

CI_twitter_by_edu <- gss_edu %>%
  filter(twitter != "NA",
         !is.na(bachelor_graduate)) %>%
  group_by(bachelor_graduate, twitter) %>%
  summarise(count = n()) %>%
  mutate(usage_proportion = count/sum(count),
         t_critical = qt(0.975, sum(count)-1),
         se_proportion = sqrt(usage_proportion * (1 - usage_proportion)/sum(count)),
         margin_of_error = t_critical * se_proportion,
         proportion_low = usage_proportion - margin_of_error,
         proportion_high = usage_proportion + margin_of_error) %>% 
  filter(twitter == "Yes") # As these are proportions, "No" is by definition, 1 - "Yes". Thus, only one is needed for the comparison.

gss_edu
## # A tibble: 2,867 x 8
##    emailmin emailhr snapchat instagrm twitter sex    degree     bachelor_gradua…
##    <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <fct>      <chr>           
##  1 0        12      NA       NA       NA      Male   Bachelor   Yes             
##  2 30       0       No       No       No      Male   High scho… No              
##  3 NA       NA      No       No       No      Male   Bachelor   Yes             
##  4 10       0       NA       NA       NA      Female High scho… No              
##  5 NA       NA      Yes      Yes      No      Female Graduate   Yes             
##  6 0        2       No       Yes      No      Female Junior co… No              
##  7 0        40      NA       NA       NA      Male   High scho… No              
##  8 NA       NA      Yes      Yes      No      Female High scho… No              
##  9 0        0       NA       NA       NA      Male   High scho… No              
## 10 NA       NA      No       No       No      Male   Junior co… No              
## # … with 2,857 more rows
twitter_usage
## # A tibble: 2 x 3
##   twitter count usage_proportion
##   <chr>   <int>            <dbl>
## 1 No        375            0.767
## 2 Yes       114            0.233
CI_twitter_by_edu
## # A tibble: 2 x 9
## # Groups:   bachelor_graduate [2]
##   bachelor_gradua… twitter count usage_proportion t_critical se_proportion
##   <chr>            <chr>   <int>            <dbl>      <dbl>         <dbl>
## 1 No               Yes       141            0.160       1.96        0.0123
## 2 Yes              Yes       114            0.233       1.96        0.0191
## # … with 3 more variables: margin_of_error <dbl>, proportion_low <dbl>,
## #   proportion_high <dbl>

The confidence intervals between the less-educated and more-educated groups do not overlap. In fact, the data suggest that we can be 95% confident that those with bachelor degrees or above are more likely to use twitter than those without.

2.3 Email usage

Can we estimate the population parameter on time spent on email weekly?

  1. Create a new variable called email that combines emailhr and emailmin to reports the number of minutes the respondents spend on email weekly.
  2. Visualise the distribution of this new variable. Find the mean and the median number of minutes respondents spend on email weekly. Is the mean or the median a better measure of the typical among of time Americans spend on email weekly? Why?
  3. Using the infer package, calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly. Interpret this interval in context of the data, reporting its endpoints in “humanized” units (e.g. instead of 108 minutes, report 1 hr and 8 minutes). If you get a result that seems a bit odd, discuss why you think this might be the case.
  4. Would you expect a 99% confidence interval to be wider or narrower than the interval you calculated above? Explain your reasoning.
gss_email <- gss %>% 
  mutate(email = as.numeric(emailhr) * 60 + as.numeric(emailmin)) # Combine hours and minutes into minutes-only format

# visualise our data
ggplot(gss_email, aes(x = email)) +
  geom_density() +
  labs(title = "Distribution of weekly minutes spent on emails",
       subtitle = "Data is heavily right-skewed",
       x = "Minutes per week",
       y = "Distribution density") +
  NULL

skim(gss_email$email)
Data summary
Name gss_email$email
Number of rows 2867
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 1218 0.58 417 680 0 50 120 480 6000 ▇▁▁▁▁

Here, the data is heavily right-skewed. This means that calculating the mean would incorporate all values in the data, including outliers whose values are far from the desired ‘average’. As this can easily sway the data, using the median would be a better option.

set.seed(1234)

boot_email <- gss_email %>%
  
  # Get rid of NAs
  drop_na(email) %>% 
  
  # Specify the variable of interest
  specify(response = email) %>%
  
  # Generate a bunch of bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  
  # Find the mean of each sample
  calculate(stat = "mean") 

#Calculate bootstrap method confidence intervals
email_CI_95 <- boot_email %>%
  get_confidence_interval(level = 0.95, type = "percentile") %>% 
# Change from minutes-only format to display Hours and minutes in the same entry
  mutate(Lower_hour = floor(lower_ci/60),
         Lower_minutes = round((lower_ci - 60 * Lower_hour), digits = 0),
         Upper_hour = floor(upper_ci/60),
         Upper_minutes = round((upper_ci - 60 * Upper_hour), digits = 0),
         Lower = paste(Lower_hour, "hr", Lower_minutes, "mins"),
         Upper = paste(Upper_hour, "hr" , Upper_minutes, "mins")) %>%
  select(Lower, Upper)

email_CI_95
## # A tibble: 1 x 2
##   Lower        Upper       
##   <chr>        <chr>       
## 1 6 hr 25 mins 7 hr 33 mins

We are 95% certain that the mean weekly hours that Americans spent on email is between 6 hr 25 minutes - 7 hr 33 minutes.

email_CI_99 <- boot_email %>% 
  get_confidence_interval(level = 0.99, type = "percentile") %>%
  mutate(Lower_hour = floor(lower_ci/60),
         Lower_minutes = round((lower_ci - 60 * Lower_hour), digits = 0),
         Upper_hour = floor(upper_ci/60),
         Upper_minutes = round((upper_ci - 60 * Upper_hour), digits = 0),
         Lower = paste(Lower_hour, "hr", Lower_minutes, "mins"),
         Upper = paste(Upper_hour, "hr" , Upper_minutes, "mins")) %>%
  select(Lower, Upper)


email_CI_99
## # A tibble: 1 x 2
##   Lower        Upper       
##   <chr>        <chr>       
## 1 6 hr 15 mins 7 hr 45 mins

As expected, the 99% confidence interval is wider than the 95% interval. The reason is that with a 99% confidence interval, we can be more confident that the actual mean is within the calculated interval, as compared to the 95% case. For this to be the case, the calculated range should be wider to incorporate additional values that would otherwise be excluded in the 95% confidence interval.

3 Trump’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data
approval_polllist <- read_csv(here::here('data', 'approval_polllist.csv'))

# or directly off fivethirtyeight website
# approval_polllist <- read_csv('https://projects.fivethirtyeight.com/trump-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 15,619
## Columns: 22
## $ president           <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate           <chr> "9/27/2020", "9/27/2020", "9/27/2020", "9/27/2020…
## $ startdate           <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate             <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster            <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade               <chr> "B", "B/C", "B-", "B", "B-", "C+", "B+", "B", "C+…
## $ samplesize          <dbl> 1500, 1992, 1632, 1500, 1651, 1500, 1190, 1500, 1…
## $ population          <chr> "a", "rv", "a", "a", "a", "lv", "rv", "a", "lv", …
## $ weight              <dbl> 0.262, 0.680, 0.153, 0.243, 0.142, 0.200, 1.514, …
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve             <dbl> 45.0, 46.0, 42.1, 45.0, 42.3, 57.0, 36.0, 46.0, 5…
## $ disapprove          <dbl> 45.0, 37.0, 45.2, 46.0, 45.8, 43.0, 44.0, 45.0, 4…
## $ adjusted_approve    <dbl> 45.7, 45.3, 43.2, 45.7, 43.4, 51.5, 37.6, 46.7, 5…
## $ adjusted_disapprove <dbl> 43.6, 38.3, 43.9, 44.6, 44.5, 44.5, 42.8, 43.6, 4…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE,…
## $ url                 <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id             <dbl> 49253, 49249, 49426, 49262, 49425, 49266, 49260, …
## $ question_id         <dbl> 77265, 77261, 77599, 77274, 77598, 77278, 77272, …
## $ createddate         <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp           <chr> "00:45:20 27 Sep 2020", "00:45:20 27 Sep 2020", "…
# Use `lubridate` to fix dates, as they are given as characters.

3.1 Create a plot

We will calculate the average net approval rate (approve- disapprove) for each week since he got into office, plotting the net approval, along with its 95% confidence interval. There are various dates given for each poll, we will use enddate, i.e., the date the poll ended.

We are facetting by year, and adding an orange line at zero.

glimpse(approval_polllist)
## Rows: 15,619
## Columns: 22
## $ president           <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate           <chr> "9/27/2020", "9/27/2020", "9/27/2020", "9/27/2020…
## $ startdate           <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate             <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster            <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade               <chr> "B", "B/C", "B-", "B", "B-", "C+", "B+", "B", "C+…
## $ samplesize          <dbl> 1500, 1992, 1632, 1500, 1651, 1500, 1190, 1500, 1…
## $ population          <chr> "a", "rv", "a", "a", "a", "lv", "rv", "a", "lv", …
## $ weight              <dbl> 0.262, 0.680, 0.153, 0.243, 0.142, 0.200, 1.514, …
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve             <dbl> 45.0, 46.0, 42.1, 45.0, 42.3, 57.0, 36.0, 46.0, 5…
## $ disapprove          <dbl> 45.0, 37.0, 45.2, 46.0, 45.8, 43.0, 44.0, 45.0, 4…
## $ adjusted_approve    <dbl> 45.7, 45.3, 43.2, 45.7, 43.4, 51.5, 37.6, 46.7, 5…
## $ adjusted_disapprove <dbl> 43.6, 38.3, 43.9, 44.6, 44.5, 44.5, 42.8, 43.6, 4…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE,…
## $ url                 <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id             <dbl> 49253, 49249, 49426, 49262, 49425, 49266, 49260, …
## $ question_id         <dbl> 77265, 77261, 77599, 77274, 77598, 77278, 77272, …
## $ createddate         <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp           <chr> "00:45:20 27 Sep 2020", "00:45:20 27 Sep 2020", "…
# Clean data set and prepare to analyse on weekly basis and across years
approval_polllist_clean <- approval_polllist %>% 
  mutate(enddate = mdy(enddate)) %>% # Use `lubridate` to fix dates, as they are given as characters.
  mutate(week_count = week(enddate)) %>% # get week counts
  mutate(year = year(enddate)) %>% # get year for facetting
  filter(subgroup == "Voters") # Keeping only "Voters" as the subgroup 

# Adding net rate to the data set as %
approval_polllist_clean <- approval_polllist_clean %>% 
  mutate(net_rate = (adjusted_approve - adjusted_disapprove) / (adjusted_approve + adjusted_disapprove) * 100) 

#Calculating average net rate, SD, SE & DF to create CIs on a weekly basis
weekly_approval_polllist <- approval_polllist_clean %>% 
  group_by(year,week_count) %>% 
  summarise(count =n(),
         MEAN = mean(net_rate),
         SD = sd(net_rate), 
         SE = SD/sqrt(count), 
         DF = count - 1) %>% 
  mutate(CI_upper = MEAN + qt(.975, DF) * SE, 
         CI_lower = MEAN - qt(.975, DF) * SE)

# colour hex codes from: https://html-color-codes.info/colors-from-image/
graph_colouring <- c("#F8766D" ,"#7CAE00", "#00BFC4", "#C77CFF")

#Plotting the data
approval_plot <- ggplot(weekly_approval_polllist, aes(x = week_count, y = MEAN, color = as.factor(year))) + 
  geom_line() +  
  facet_wrap(~year) + 
  geom_point(size = 1) +
  geom_hline(yintercept = 0, color = "orange") + 
  scale_x_continuous (limits = c(0, 53),
                      breaks = c(0, 13, 26, 39, 52),
                      labels = c("0", "13","26","39","52")) + 
  scale_y_continuous (limits=c(-22, 9),
                      breaks=c(-20, -17.5, -15, -12.5, -10, -7.5, -5, -2.5, 0, 2.5, 5, 7.5)) +
  geom_ribbon(aes(ymin = CI_lower, ymax = CI_upper, fill = as.factor(year)), alpha = .1) +
  labs(y = "Average Net Approval (%)", 
       x = "Week of the year") + 
  ggtitle(label = "Estimating Net Approval (approve - dissaprove) for Donald Trump", 
          subtitle = "Weekly average of all polls") +
  theme(title = element_text(size=8),
        axis.title = element_text(size=8),
        axis.text = element_text(size=8),
        axis.ticks = element_blank(),
        strip.text = element_text(size=8),
        panel.background  = element_rect(color="black", fill = "white"),
        panel.border = element_blank(),
        strip.background = element_rect(color="black", fill="grey", size=.25),
        panel.grid = element_line(color = "#CCCCCC"),
        legend.position = "none") +
  scale_colour_manual(aesthetics = "custom_color_palette") +
  #coord_fixed(ratio= 3/5, clip = "on") + 
  theme(plot.margin = unit(c(1,1,0,1), "cm")) 

ggsave("plot1.png",
       plot = last_plot(),
       scale = 1,
       width = 25,
       height = 12.5,
       units = "cm",
       dpi = 300,
       limitsize = TRUE)

knitr::include_graphics(here::here("plot1.png"), error = FALSE)

3.2 Compare Confidence Intervals

What’s going on with the confidence intervals between week 15 (6-12 April 2020) and week 34 (17-23 August 2020)?

The 95% confidence interval for week 15 (-6.45, -9.56) is relatively narrower than the interval for week 34 (-9.68, -13.69) implying a tighter cluster of polling data near the mean. This translates to a higher proportion of people sticking to their approval rating in week 15 - the SD of the polls is lower. Thiscompares to week 34 where the mean approval ratings were generally lower but on a more volatile upwards trend. The volatility might be due to the fact that week 34 contained multiple interesting and polarising events. Both Obamas publicly criticised Trump, Fox News praised Joe Biden’s latest speech and Steve Bannon was indicted. Such events will have angered many undecided voters causign their disapproval to rise, but will potentially also have been received as unfair by hard-core Trump supporters, ultimately icnreasing polling volatility.Finally, we can also see that the confidence intervals for the approval ratings don’t overlap so we can say that we are 95% certain that there has been a shift in the populations approval rating.

4 Gapminder revisited

Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. In this part, we will join a few dataframes with more data than the ‘gapminder’ package. Specifically, we will look at data on

We will use the wbstats package to download data from the World Bank. The relevant World Bank indicators are SP.DYN.TFRT.IN, SE.PRM.NENR, NY.GDP.PCAP.KD, and SH.DYN.MORT

# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))

# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")


library(wbstats)

worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
                          indicator = indicators, 
                          start_date = 1960, 
                          end_date = 2016)

# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels,  from the World Bank API 
countries <-  wbstats::wb_cachelist$countries

We will join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it.

# clean to longer tidy data
hiv_cleaned <- hiv %>%
  select(country, c(13:34)) %>%
  pivot_longer(cols = 2:23, names_to = "Year", values_to = "HIV_rate_per_100") %>%
  mutate(Year = as.numeric(Year)) # create numeric year value

# clean to longer tidy data
life_expectancy_cleaned <- life_expectancy %>%
  select(country, c(192:213)) %>%
  pivot_longer(cols = 2:23, names_to = "Year", values_to = "life_expectancy") %>%
  mutate(Year = as.numeric(Year)) # create numeric year value
  

worldbank_cleaned <- worldbank_data %>%
  # choose only years for which all to be merged data frames have data
  filter(date >= 1990, 
         date <= 2011) %>%
  rename(Year = date) %>%
  select(c(3:8)) %>%
  rename(gdpPerCap = NY.GDP.PCAP.KD,
         school_enrollment_rate = SE.PRM.NENR,
         mortality_rate_under5 = SH.DYN.MORT,
         fertility_rate = SP.DYN.TFRT.IN)

# Merge all data sets
compiled_countries_data <- worldbank_cleaned %>%
  left_join(hiv_cleaned, by = c("country", "Year")) %>%
  left_join(life_expectancy_cleaned, by = c("country", "Year")) %>%
  left_join(countries, by = "country") %>%
  select(region, country, Year, gdpPerCap, school_enrollment_rate, mortality_rate_under5, fertility_rate, HIV_rate_per_100, life_expectancy)
  • The hiv and life_expectancy data frames are reformatted using pivot_longer to match the format of the worldbank_data data frame. Its longer format is chosen because this allows multiple indicators to be recorded over time, whereas in the other two data frames only one indicator can be included. In addition, in order to allow for faceting the data by region where necessary, the region column in the countries data frame has also been merged to the new data frame.
  • Moreover, the time range of all data frames are also synchronised to ensure consistency. Whilst hiv, the data frame with the shortest time span, ranges from 1979 until 2011, in the earlier years hiv data do not exist for almost all countries in the list. In fact, data for most countries were only recorded from 1990. For this reason, data in all the data frames have been filtered to only include entries between 1990 and 2011.

4.1 What is the relationship between HIV prevalence and life expectancy?

hiv_life <- compiled_countries_data %>%
  group_by(Year, region) %>%
  drop_na(HIV_rate_per_100, life_expectancy) %>%
  summarise(correlation = cor(HIV_rate_per_100, life_expectancy)) %>%
  filter(region != c("North America", "Middle East & North Africa")) # Graph will be faceted by region, so these 2 regions should be excluded because there are not enough samples.

# Plot correlation coefficient - time graph instead of HIV - life expectancy graph. This allows further analysis of how the relationship change over time.
ggplot(hiv_life, aes(x = Year, y = correlation)) +
  geom_point() +
  facet_wrap(~region, scales = "free_x") + # faceting by region allows data to be grouped by countries with more similar socio-economic conditions.
  geom_smooth() + 
  theme_bw() +
  labs(title = "Correlation of HIV Prevalence and Life Expectancy over Time",
       subtitle = "North America, Middle East & North Africa excluded due to insufficient samples",
       y = "Correlation Coefficient",
       x = "Year")

In most regions, there has been a negative correlation between HIV prevalence and life expectancy over time. This suggests that the higher the HIV rate, the lower the life expectancy. However, this seems to be the case only for less developed regions. In Europe & Central Asia where living conditions are significantly higher, the correlation coefficient between 1990-2000 is actually positive. Since this is counter-intuitive, the only reasonable explanation would be that as living conditions increase, HIV rate becomes much less relevant in deciding life expectancy. This also explains why Sub-Saharan Africa and South Asia, where economic development was high between 2005-2011, experienced significant decreases in the absolute value of correlation coefficients.

4.2 What is the relationship between fertility rate and GDP per capita?

compiled_countries_data %>%
  filter(region != "North America") %>%
  ggplot(aes(x = gdpPerCap, y = fertility_rate)) +
  geom_point() +
  facet_wrap(~region, scales = "free") +
  theme_light() +
  geom_smooth() +
  labs(title = "Relationship between fertility rate and income varies by region",
       subtitle = "North America excluded due to insufficient samples",
       x = "GDP Per Capita",
       y = "Fertility rate (Births per woman)")

At lower income levels, there is a consistent trend that the higher the average income, the lower the fertility rate (negative correlation). Conversely, at mid-income levels, birth rates actually increase with GDP per capita for most countries (positive correlation), although the degree by which it increases varies greatly by region. Finally, at higher income levels, the correlation becomes negative again.

4.3 Which regions have the most observations with missing HIV data?

HIV_missing_data <- compiled_countries_data %>%
  group_by(region) %>%
  summarise(missing_entries = sum(is.na(HIV_rate_per_100)))

# Compare by number of missing entries
ggplot(HIV_missing_data, aes(x = missing_entries, y = reorder(region, missing_entries))) +
  geom_col() +
  theme_economist() +
  labs(title = "Missing HIV data",
       subtitle = "By count of missing observations",
       y = "",
       x = "Number of missing entries") +
  NULL

HIV_missing_data_2 <- compiled_countries_data %>%
  group_by(region) %>%
  summarise(missing_entries = sum(is.na(HIV_rate_per_100)),
            count = n()) %>%
  mutate(percentage_missing = missing_entries/count)

# Compare by percentage of missing entries
ggplot(HIV_missing_data_2, aes(x = percentage_missing, y = reorder(region, percentage_missing))) +
  geom_col() +
  theme_economist() +
  scale_x_continuous(labels = scales::percent) +
  labs(title = "Missing HIV data",
       subtitle = "By percentage of missing observations",
       y = "",
       x = "Percentage") +
  NULL

East Asia & Pacific has the most observations with missing HIV data, by both count of missing entries and percentage of missing entries.

4.4 How has mortality rate for under 5 changed by region?

We will find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.

mortality_plot <- compiled_countries_data %>%
  group_by(region, Year) %>%
  summarise(mean_mortality_rate = mean(mortality_rate_under5, na.rm = TRUE))

ggplot(mortality_plot, aes(x = Year, y = mean_mortality_rate)) +
  geom_point() +
  facet_wrap(~region, scales = "free_x") +
  theme_bw() +
  labs(title = "Mortality rate decreased over time across all regions",
       subtitle = "most significantly in South Asia and Sub-Saharan Africa",
       y = "Mortality rate of children under 5 (per 1,000 live births)",
       x = "") +
    NULL

# Collect data for best performing countries
mortality_head <- compiled_countries_data %>%
  filter(Year == c(1990, 2011)) %>%
  select(region, country, Year, mortality_rate_under5) %>%
  pivot_wider(names_from = Year, values_from = mortality_rate_under5) %>%
  rename(start = 3,
         end = 4) %>%
  mutate(change = end - start) %>%
  drop_na(change) %>%
  arrange(region, change, na.rm = TRUE) %>%
  group_by(region) %>%
  slice_head(n=5)

# Collect data for worst performing countries
mortality_tail <- compiled_countries_data %>%
  filter(Year == c(1990, 2011)) %>%
  select(region, country, Year, mortality_rate_under5) %>%
  pivot_wider(names_from = Year, values_from = mortality_rate_under5) %>%
  rename(start = 3,
         end = 4) %>%
  mutate(change = end - start) %>%
  drop_na(change) %>%
  arrange(region, change, na.rm = TRUE) %>%
  group_by(region) %>%
  slice_tail(n=5)

# Combine the 2 sets of data above
mortality_top_and_bottom <- bind_rows(mortality_head, mortality_tail) %>%
  arrange(region, change) %>%
  select(region, country, change) %>%
  rename(change_in_mortality_rate = change)
  
mortality_top_and_bottom
## # A tibble: 64 x 3
## # Groups:   region [7]
##    region              country           change_in_mortality_rate
##    <chr>               <chr>                                <dbl>
##  1 East Asia & Pacific Timor-Leste                        -116.  
##  2 East Asia & Pacific Lao PDR                             -88.2 
##  3 East Asia & Pacific Mongolia                            -80.2 
##  4 East Asia & Pacific Cambodia                            -75.3 
##  5 East Asia & Pacific Myanmar                             -53.9 
##  6 East Asia & Pacific New Zealand                          -5.20
##  7 East Asia & Pacific Singapore                            -4.9 
##  8 East Asia & Pacific Australia                            -4.70
##  9 East Asia & Pacific Brunei Darussalam                    -3.5 
## 10 East Asia & Pacific Japan                                -3.10
## # … with 54 more rows

Here, a negative change in mortality rate implies an improvement. Hence, the 5 countries that improved the most are those with the most negative changes in mortality rate, and vice versa for the least improved. The table compiled is for the 1990-2011 period. The North American region only consists out of 2 countries, the USA and Canada, so both countries appear in the top 5 for most and least improvement.

4.5 Is there a relationship between primary school enrollment and fertility rate?

compiled_countries_data %>%
  filter(region != "North America") %>%
  ggplot(aes(x = fertility_rate, y = school_enrollment_rate)) +
  geom_point() +
  facet_wrap(~region, scales = "free") +
  expand_limits(y = c(0,100)) +
  theme_light() +
  geom_smooth() +
  labs(title = "Varied relationships between fertility rate and primary school enrollment",
       subtitle = "North America excluded due to insufficient samples",
       x = "Fertility rate (Births per woman)",
       y = "Primary School Enrollment (%)")

Whilst the relationship between fertility rate and primary school enrollment varies by region, on overall there seems to be a negative relationship between the 2 variables in less developed regions (Latin America & Caribbean, Middle East & North Africa, Sub-Saharan Africa, South Asia, and countries with really high birth rates in East Asia & Pacific). The correlation, however, fades in more developed countries (Europe & Central Asia, and most of East Asia & Pacific). As with the HIV prevalence - life expectancy case, the most plausible explanation is that as living conditions increase, fertility rates become much less relevant in deciding primary school enrollment.

5 Challenge 1: CDC COVID-19 Public Use Data

Let us revisit the CDC Covid-19 Case Surveillance Data. There are well over 3 million entries of individual, de-identified patient data. Since this is a large file, I suggest you use vroom to load it and you keep cache=TRUE in the chunk options.

# file contains 11 variables and 3.66m rows and is well over 380Mb. 
# It will take time to download

# URL link to CDC to download data
url <- "https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD"

covid_data <- vroom::vroom(url)%>% # If vroom::vroom(url) doesn't work, use read_csv(url)
  clean_names()

We will in the following show death % rate:

  • by age group, sex, and whether the patient had co-morbidities or not
  • by age group, sex, and whether the patient was admited to Intensive Care Unit (ICU) or not.
covid_plot1_data <- covid_data %>%
#remove all observations where relevant data are missing
  mutate_if(is.character, list(~na_if(., "Unknown"))) %>%
  mutate_if(is.character, list(~na_if(., "Missing"))) %>%
  mutate(sex = na_if(sex, "Other")) %>%   # to ensure sex response "Other" is dropped with other NAs
  drop_na(age_group, sex, medcond_yn, death_yn) %>% # drop NAs
  group_by(age_group, sex, medcond_yn, death_yn) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(age_group, sex, medcond_yn) %>%
  mutate(subgroup_total = sum(count)) %>%
  ungroup() %>%
  filter(death_yn == "Yes") %>%
  mutate(death_percentage = round(count/subgroup_total, digits = 3),
         medcond_yn = factor(medcond_yn, levels = c("Yes", "No"),
                             labels = c("With comorbidities", "Without comorbidities"))) # Rename facet labels

new <- theme_set(theme_bw())
theme_replace(plot.tag.position = c(0.95,0.01),
        plot.tag = element_text(size = 9)) # Custom format tag position and font size for "Source: CDC"

covid_plot1 <- ggplot(covid_plot1_data, aes(x = death_percentage, y = reorder(age_group, age_group))) +
  geom_col(fill = "#6666CC") +
  facet_grid(medcond_yn ~ sex) +
  coord_cartesian(xlim = c(0, 0.7), clip = "off") +
  scale_x_continuous(labels = scales::percent) +
  theme_set(new) +
  labs(title = "Covid death % by age group, sex and presence of co-morbidities",
       x = "",
       y = "",
       caption = "Souce: CDC") +
  geom_text(aes(label = death_percentage*100), size = 3, hjust = -0.1) +
  NULL

ggsave("covid_plot1.png", width = 33.87, height = 18.15, units = "cm", dpi = 96) # ggsave to resize graph

new <- theme_set(theme_bw())
theme_replace(plot.tag.position = c(0.95,0.01),
        plot.tag = element_text(size = 9)) # Repeat custom format for 2nd graph

covid_plot2_data <- covid_data %>%
#remove all observations where relevant data are missing
  mutate_if(is.character, list(~na_if(., "Unknown"))) %>%
  mutate_if(is.character, list(~na_if(., "Missing"))) %>%
  mutate(sex = na_if(sex, "Other")) %>% # to ensure sex response "Other" is dropped with other NAs
  drop_na(age_group, sex, icu_yn, death_yn) %>% # drop NAs
  group_by(age_group, sex, icu_yn, death_yn) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(age_group, sex, icu_yn) %>%
  mutate(subgroup_total = sum(count)) %>%
  ungroup() %>%
  filter(death_yn == "Yes") %>%
  mutate(death_percentage = round(count/subgroup_total, digits = 3),
         icu_yn = factor(icu_yn, levels = c("Yes", "No"),
                         labels = c("Admitted to ICU", "No ICU"))) # Rename facet labels

covid_plot2 <- ggplot(covid_plot2_data, aes(x = death_percentage, y = reorder(age_group, age_group))) +
  geom_col(fill = "#FF9999") +
  facet_grid(icu_yn ~ sex) +
  coord_cartesian(xlim = c(0, 0.85), clip = "off") +
  scale_x_continuous(labels = scales::percent) +
  theme_set(new) +
  labs(title = "Covid death % by age group, sex and whether patient was admitted to ICU",
       x = "",
       y = "",
       caption = "Souce: CDC") +
  geom_text(aes(label = death_percentage*100), size = 3, hjust = -0.1) +
  NULL

ggsave("covid_plot2.png", width = 33.87, height = 18.15, units = "cm", dpi = 96) # ggsave to resize graph

# display saved graphs
knitr::include_graphics(here::here("covid_plot1.png"), error = FALSE)

knitr::include_graphics(here::here("covid_plot2.png"), error = FALSE)

6 Challenge 2: Excess rentals in TfL bike sharing

We will look at TfL data on how many bikes were hired every single day. We can get the latest data by running the following

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2020-09-18T09%3A06%3A54/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20201004%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201004T192007Z&X-Amz-Expires=300&X-Amz-Signature=e5010ca348cffd3ce020fad043f1bea706a88ab7ac6ee28ad406448699ffc1ea&X-Amz-SignedHeaders=host]
##   Date: 2020-10-04 19:20
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 165 kB
## <ON DISK>  /var/folders/dw/7jxg1bl50110hhppf2gnfvkm0000gn/T//Rtmpb60ZYO/file121e1235697ad.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year.

We can clearly that 2020 has not been a normal year with especially low rental rates in May and June due to the Covid-19 lockdown.

We will investigate this further by looking into the percentage deviations from the long-term average by both month and week. The long-term average will be calculated by using the mean to ensure that all days in a given month are given the same weight.

# create monthly forecast based on historic averages
bike_exp <- bike %>% 
  filter(year %in% c(2015:2019)) %>% # exclude 2020 to avoid outliers
  group_by(month) %>% 
  summarise(expected_rentals = mean(bikes_hired)) # use mean to calculate expected rentals

# create monthly actuals
bike_avg <- bike %>% 
  filter(year >= "2015") %>% 
  group_by(month, year) %>% 
  summarise(actual_rentals = mean(bikes_hired)) %>% 
  left_join(y = bike_exp, join_by =  month) %>% # merge forecasts with actuals
  mutate(excess_rentals = actual_rentals - expected_rentals, # calculate excess bikes rental
         date_month = as.Date(paste0("2015-", match(month, month.abb),"-01"),"%Y-%m-%d")) # use a date format for graph. 2015 used for no reason, but wont show up

# Create graph
bike_month_surplus <- ggplot(bike_avg, aes(x = date_month, y = actual_rentals)) +
  # create two line graphs
  geom_line(aes(y = expected_rentals), color = "blue", size = 1) +
  geom_line(aes(y = actual_rentals), color = "black", size = 0.25) +
  # create ribbons for deviations. Deficits to forecast shall be red, surpluses will be green
  geom_ribbon(aes(ymin = expected_rentals + pmin(excess_rentals,0), ymax = expected_rentals, fill = "Deficit", alpha = .2)) + 
  geom_ribbon(aes(ymin = expected_rentals, ymax = expected_rentals + pmax(excess_rentals,0), fill = "Surplus", alpha = .2)) + 
  scale_fill_manual(values=c("#CB454A","#7DCD85"), name="Deficit vs. Surplus") +
  # facet wrap for year
  facet_wrap( ~year) +
  theme_bw(base_size = 15) +
  scale_x_date(date_labels = "%b", date_breaks  ="months")  + # only show month on x-axis
  # create custom theme
  theme(title = element_text(size=8),
        axis.text = element_text(size=8),
        axis.ticks = element_blank(),
        axis.title.x= element_blank(),
        panel.background  = element_rect(color="white", fill = "white"),
        panel.border = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major = element_line(color="grey", size = 0.1),
        strip.background = element_rect(color="white", fill="white"),
        strip.text = element_text(size=8),
        panel.grid = element_line(color = "#D3D3D3"),
        legend.position = "none") +
  labs(y = "Monthly bike rentals",
       caption = "Source:TfL, London Data Store") + 
  ggtitle(label = "Monthly changes in TfL bike rentals", 
          subtitle = "Change in monthly average shown in blue \n and calculated between 2015-2019") 

# Save graph as custom format
ggsave("TfL_month.png",
       plot = last_plot(),
       scale = 1,
       width = 25,
       height = 20,
       units = "cm",
       dpi = 300,
       limitsize = TRUE)

knitr::include_graphics(here::here("TfL_month.png"), error = FALSE)

# Calculate weekly forecasts as long term weekly average
bike_exp_week <- bike %>% 
  filter(year %in% c(2015:2019)) %>% # exclude 2020 to avoid outliers
  group_by(week) %>% 
  summarise(expected_rentals = mean(bikes_hired))

# Create weekly actuals, merge forecast to actuals
bike_avg_week <- bike %>% 
  filter(year >= "2015") %>% 
  group_by(week, year) %>% 
  summarise(actual_rentals = mean(bikes_hired)) %>% 
  left_join(y = bike_exp_week, join_by =  week) %>% 
  # Calculate weekly deviation from long-term average
  mutate(excess_rentals = actual_rentals - expected_rentals,
         perc_change = excess_rentals / expected_rentals)

# Create graph
bike_week_deviation <- ggplot(bike_avg_week, aes(x = week, y = perc_change)) +
  # create quarter shading
  geom_rect(xmin=13, xmax=26, ymin=-1, ymax=Inf,fill = "grey",alpha=0.01) + 
  geom_rect(xmin=39, xmax=53, ymin=-1, ymax=Inf,fill = "grey",alpha=0.01) +
  # create deviation and forecast (y=0) lines
  geom_line(color = "black", size = 0.25) +
  geom_line(aes(y = 0), color = "black", size = 0.25) + 
  # Create ribbon to highligh Deficits and Surplusses in red and green respectively
  geom_ribbon(aes(ymin = pmin(perc_change,0), ymax = 0, fill = "Deficit", alpha = .2)) +
  geom_ribbon(aes(ymin = 0, ymax = pmax(perc_change,0), fill = "Surplus", alpha = .2)) +
  scale_fill_manual(values=c("#CB454A","#7DCD85"), name="Deficit vs. Surplus") +
  # Create rug at bottom of the graphs
  geom_rug(aes(colour = ifelse(actual_rentals >= expected_rentals,">=0","<0")), sides = "b")+
  scale_colour_manual(values=c("#CB454A","#7DCD85"),name="Deficit vs. Surplus", guide=FALSE) +
  # facet wrap by year
  facet_wrap( ~year) +
  theme_bw(base_size = 15) +
  # custom x and y axis scales
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous (limits = c(0, 53),
                      breaks = c(0, 13, 26, 39, 53),
                      labels = c("", "13","26","39","53")) + 
  # Create custom theme
  theme(title = element_text(size=8),
        axis.text = element_text(size=8),
        axis.ticks = element_blank(),
        axis.title.y= element_blank(),
        panel.background  = element_rect(color="white", fill = "white"),
        panel.border = element_blank(),
        panel.grid.major = element_line(color="grey", size = 0.1),
        strip.background = element_rect(color="white", fill="white"),
        strip.text = element_text(size=8),
        panel.grid = element_line(color = "#D3D3D3"),
        legend.position = "none") +
  labs(x = "week",
       caption = "Source:TfL, London Data Store") +
  ggtitle(label = "Weekly changes in TfL bike rentals",
          subtitle = "% changes from weekly averages \n calculated between 2015 - 2019")

# Save graph as custom format
ggsave("TfL_week.png",
       plot = last_plot(),
       scale = 1,
       width = 25,
       height = 20,
       units = "cm",
       dpi = 300,
       limitsize = TRUE)

knitr::include_graphics(here::here("TfL_week.png"), error = FALSE)

7 Deliverables

As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

8 Details

  • Who did you collaborate with: Leif Beckers, Dung Tran, Salman Abdullah, Andjela Bozinovic, Xiwen Wang
  • Approximately how much time did you spend on this problem set: 15 h
  • What, if anything, gave you the most trouble: getting the graph to look closer to actual picture (re-scaling, etc)

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

9 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.